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V^". TESTING PREDICTOR CONTRIBUTIONS IN SUFFICIENT 

O ■ DIMENSION REDUCTION 

o' 

^ ! By R. Dennis Cook^ 

^ ■ University of Minnesota 

. -^ We develop tests of the hypothesis of no effect for selected predic- 

-sj , tors in regression, without assuming a model for the conditional dis- 

tribution of the response given the predictors. Predictor effects need 
not be limited to the mean function and smoothing is not required. 
pH , The general approach is based on sufficient dimension reduction, the 

^0 ' idea being to replace the predictor vector with a lower-dimensional 

(-H , version without loss of information on the regression. Methodology 

■4-J ■ using sliced inverse regression is developed in detail. 

^' 

1. Introduction. In full generality, the goal of a regression is to infer 

about the conditional distribution of the univariate response variable Y given 
"^ ■ the p X 1 vector of predictors X: How does the conditional distribution 

i^ I of y|X change with the value assumed by X? Many different statistical 

^N ■ contexts have been developed to address this issue. In this article we consider 

sufficient dimension reduction (SDR), the basic idea being to replace the 
(^ I predictor vector with its projection onto a subspace of the predictor space 

^f ■ without loss of information on y|X. More formally, we seek subspaces S of 

the predictor space with the property that 
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(1) YJLX\PsX, 



where _1L indicates independence, Pf.\ stands for a projection operator in 
^ ! the standard inner product and, for future reference, Q(.) = Ip — P(.). The 

statement is thus that Y is independent of X given any value for P5X. Sub- 
spaces with this property are called dimension reduction subspaces. Letting 
P^ \ k = dim(5), a regression inquiry can then be limited to k <p new predictors, 

expressed as linear combinations of the original ones: vJ^X, . . . , v^X, where 
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2 R. D. COOK 

the basis {vi, . . . , v^} for S is often chosen so that the new predictors are 
uncor related. 

When the intersection of all subspaces satisfying (1) also satisfies (1) it is 
called the central subspace (CS) [Cook (1994, 1996, 1998a)] and is denoted 
by ^yix- The central subspace, which is assumed to exist throughout this 
article, is a population metaparameter that can be taken as the parsimo- 
nious target of a dimension reduction inquiry. Its dimension d = dim(5y |x) is 
called the structural dimension of the regression. There are several methods 
available that can be used to estimate the CS, including sliced inverse re- 
gression (SIR) [Li (1991)], sliced average variance estimation (SAVE) [Cook 
and Weisberg (1991)], graphical regression [Cook (1994, 1998a)], parametric 
inverse regression [Bura and Cook (2001b)] and partial SIR [Chiaromonte, 
Cook and Li (2002)] when categorical predictors are present. Cook and Weis- 
berg (1999a) gave an introductory account of studying regressions via central 
subspaces. 

Other dimension reduction methods estimate the central mean subspace 
[Cook and Li (2002)], which is a subspace of the CS that captures the mean 
function. These include ordinary least squares (OLS) and related methods 
based on convex objective functions, principal Hessian directions [Li (1992) 
and Cook (1998b)], iterative Hessian transformation [Cook and Li (2002)] 
and minimum average variance estimation [Xia, Tang, Li and Zhu (2002)]. 
In this article we are concerned only with the CS. 

The estimation methods for the CS mentioned previously are all consis- 
tent under reasonable conditions when the dimension d of the CS is known. 
Inference on d is often based on hypothesis testing: Starting with m = 0, 
test the hypothesis d = m versus d> m. If the test is rejected, increment m 
by 1 and test again, stopping with the first nonsignificant result. This type 
of procedure is fairly common for estimating the dimension of a subspace 
[see, e.g., Rao (1965), page 472]. Once an estimate d is obtained, subse- 
quent analysis, including choice of a first model, is typically guided by a 
summary plot of Y versus the new predictors t)^ X, . . . , t) ? X, where t)^ E W 
and {fji, . . . ,f]^} is the estimated basis for 5y|x- Examples of this process 
are available throughout the SDR literature. For recent examples, see Chen 
and Li (1998), Cook and Lee (1999) and Chiaromonte, Cook and Li (2002). 

The ability to test the significance of subsets of predictors is often im- 
portant in model-based regression, but is currently unavailable in SDR. In 
this article we develop tests of hypotheses involving statements of the form 
-Pw5y|x = Op, where TC is a user-selected subspace of the predictor space 
that specifies the hypothesis, and Op indicates the origin in K^. Partitioning 
X'^ = (X-i , Xg"), we imagine a typical application to test the hypothesis that 
r selected predictors X2 do not contribute to the regression. Let the columns 
of the px d matrix 77 be a basis for Sy\x. ^iid partition 77^ = {rji, 772") accord- 
ing to the partition of X. By definition of 5y|x, ^-IL XJTy X. We wish a test 
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of the hypothesis Y A. X.\r]J'Ki so that the coordinate subspace Span(r72) co- 
incides with the origin, Span(T72) = C'r- This can be expressed in terms of the 
statement P'H'Sy\x. = ^p by choosing H = Span((0,/j.)-^) to be the subspace 
of M^ corresponding to the coordinates X2 in question. Because we expect 7i 
win typicahy be chosen to target selected predictors, we refer to hypotheses 
of the form P-wSyix = ^p ^^ coordinate hypotheses, although 7i need not 
correspond to a subset of predictors (coordinates). We let r = dim('H). 

The following proposition gives a conditional independence interpretation 
of the statement Pn<SY\x = ^p- Its proof is sketched in the Appendix. 

Proposition 1. PhSy\x = Cp if o^nd only ifYJLP-H^lQn^- 

Consequently, a coordinate hypothesis test can be regarded as a test of 
the hypothesis that, given Q^X, the orthogonal part P^X of the predictor 
vector contains no information about the response. With TC = Span((0, Ir)'^) 
the hypothesis P'h<Sy\x = C'p is equivalent to the hypothesis that Y and X2 
are conditionally independent given Xi, Y JL X2IX1. 

In this article we consider three kinds of hypotheses that could be useful 
depending on the application-specific requirements: 

1. Marginal dimension hypotheses — d = m versus d> m; 

2. Marginal coordinate hypotheses — P-j-cSyix = ^p versus P'H'Sy\x / ^p] 

3. Conditional coordinate hypotheses — Ph'Sy\x = ^p versus Ph<Sy\x ¥" ^p 
given d. 

Marginal dimension hypotheses are considered extensively in the literature 
and are mentioned here for completeness. The other two forms are new and 
tests for them are developed in this article. Any of the dimension reduction 
methods mentioned previously (e.g., SIR, SAVE or PIR) could in principle 
be a foundation for tests of these hypotheses. In effect, graphical regression 
[Cook (1994, 1998a)] is built on our ability to assess coordinate hypotheses 
in a series of three-dimensional plots. In this article we use SIR to develop 
formal asymptotic tests of the two new hypotheses. 

Our use of SIR to develop tests of hypotheses involving coordinate restric- 
tions depends on rederiving it as the solution to a multivariate nonlinear least 
squares problem. This is done in Section 3.1 following further discussion of 
preliminary issues in Section 2. The population structure of SIR is related to 
the coordinate hypotheses in Section 3.2, and general results on test statis- 
tic construction are described in Section 4. In Sections 5 and 6 we develop 
the tests for the marginal and conditional coordinate hypotheses, including 
asymptotic null distributions and suggestions for implementation. Simula- 
tion results on level and power along with an illustrative data analysis are 
reported in Section 7. Concluding comments are given in Section 8, along 
with additional discussion of the literature and its relation to this work. To 
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avoid interrupting the discussion, proofs for most results are given in the 
Appendix. 

2. Preparations. We assume throughout this article that the data (l^,Xj) 
i = 1, . . . ,n, X G M^, are i.i.d. observations on (y, X), which has a joint distri- 
bution with finite fourth moments and S = Var(X) > 0. In keeping with the 
usual SIR protocol, we assume also that the response has been discretized by 
constructing h slices so that Y takes values in {1,2, . . . ,/i}. The jth value 
of Y is called the jth slice. This slicing step might be unnecessary if the 
response is naturally discrete or categorical. 

Let the standardized predictors be denoted by 



with sample version 



Z = I]~i/2(X-E(X)) 



" -1/2 



where subscript {yj) indicates observation j in slice y, y = 1, . . . ,h, j = 
1, . . . , Uy, n = J2y ny, X = J2yj ^yj/^ is the sample mean of the X^j's, 

. h riy 

^ = -EE(Xj;.-x)(x,,-xf 

is the usual sample covariance matrix and I]~ ' denotes the unique sym- 
metric positive-definite square root of S^^. To allow use of the usual inner 
product in subsequent developments and without loss of generality, we work 
in the Z scale with central subspace Sy\z = ^^ '5y|x [Cook (1998a), Propo- 
sition 6.3], letting the columns of the p x d matrix 7 be an orthonormal 
basis for Sy\z- Summations J2yj with implicit limits (yj) are always over 
y = l,...,h, j = l,...,ny. 

In practice coordinate hypotheses will typically be formulated in the orig- 
inal X scale by selecting an appropriate basis for 7i. A coordinate hypothesis 
could then be stated as cx.^r] = 0, where a^ is the user-selected basis for 7i 
expressed as a p x r matrix of full column rank r, and 77 is a basis for 5y|x- 
For example, to test if a selected subset of r predictors contributes to the 
regression we can test if the rows of r/ corresponding to the r predictors in 
question are all zero vectors. The matrix a^ can then be chosen to select 
the appropriate rows of rj. 

The hypothesis ct^Tj = holds if and only if a (Xl ' 77) = 0, where a = 
I]~ ' ctx and the columns of XI ' 77 form a basis for Syiz- A coordinate 
hypothesis in the X scale, P-wSyix = ^p with 7i = Span(a^), can be restated 
in the Z scale as Pn<SY\z = ^p with TC = Span(Q:). Thus by appropriate 
choice of basis, ctx or a, we can work in either scale. 
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Back to the Z scale, without loss of generality we take the columns of 

(2) a = 5]-V2^^.(^T5.-i^^)-i/2 

to be an orthonormal basis for 7i in the remainder of this article. The hy- 
pothesis P-hSyiz = C'p holds if and only if Syiz is in the orthogonal comple- 
ment of TC and consequently under the hypothesis we must have r <p — d. 
Otherwise the hypothesis is certainly false. 

3. SIR. 

3.1. Nonlinear least squares formulation. The development of SIR as a 
means to estimate Syiz requires the following condition: 

(CI) Linearity condition — E(Z|P5y|2.Z) = P^^.^Z. 

This condition, which is common in SDR, is equivalent to requiring that 
E(Z|7^Z) be a linear function of 7^Z [Cook (1998a), Proposition 4.2]. Li's 
(1991) design condition is equivalent to (CI), which applies to the marginal 
distribution of the predictors and not to the conditional distribution of y|Z 
as is common in regression modeling. Consequently, we are free to use exper- 
imental design, one-to-one predictor transformations r or reweighting [Cook 
and Nachtsheim (1994)] to induce the condition when necessary without suf- 
fering complications when inferring about Y\Z. Since we are not assuming 
a model for y|X, these adaptation methods need not change the fundamen- 
tal issues in the regression. For example, because y|(X = x) has the same 
distribution as y|(r(X) = r(x)), predictor transformations just change the 
way in which the conditional distribution of y|X is indexed. The linearity 
condition holds for elliptically contoured predictors. Additionally, Hall and 
Li (1993) showed that as p increases with d fixed the linearity condition 
holds to a reasonable approximation in many problems. 

The linearity condition implies that the conditional means E(Z|y) lie in 
the CS for all values of Y [Li (1991)]. We take this a step further and assume 
the following condition: 

(C2) Coverage condition — Span{E(Z|y = y)\y = 1,. . . ,h} = Sy\z, 

so that the subspace spanned by the inverse conditional means coincides 
with the CS. This condition is also common in regression studies based on 
SIR. It requires in part that h>d+l. For subsequent tests on d we require 
h>d + l. 

For each value y of y we can now find a vector p G M.'^ such that 

E(Z|y = y)=7p^, 

where 7 is the basis matrix for Syiz defined previously. Because E(Z) = 
we must have E(py) = J2y fyPy — 0' where fy = Pr(y = y) is the probability 
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of slice y. This suggests that for fixed d estimates of 7 and p„ can be 
constructed by minimizing the least squares loss function 

h ny 

y=ii=i 
over B in the Stiefel manifold [Muirhead (1982), page 67] of all p x d semi- 
orthogonal matrices and over Cy G W^ subject to J2y fy^y — 0) where / = 
Uy/n is the observed fraction of observations falling in slice y. The values of 
B and Cy that minimize L^ are then taken as the estimates 'y and Py of 7 
and Py, y = 1 . . . ,h. Although we refer to 7 and Py as estimates, it may be, 
strictly speaking, more appropriate to think of them as solutions since they 
can be replaced by -yli^ and Hp , where H is any orthogonal matrix. 
Minimizing Ld results in the SIR estimate of Syiz when d is regarded as 

known: Let Zy = J2j ^yjl'^y t'e the average of the Zyj in slice y, and write 
Lrf as 

La!(B,Cj^) = 2^||Zj^j - Zyll +2^||Zy -BCyll . 
yj yj 

For fixed B the minimum is attained by 

Cy = B^Zy, y = l,...,h. 
Then minimizing Lrf(B, Cy) over B yields the SIR estimate of Syiz- To sum- 
marize the essential result, let M = J2y fy^y^y denote the sample covariance 
matrix of the slice means, and let Ai > • • ■ > Ap denote the eigenvalues of 
M. Then the columns of 7 are the eigenvectors corresponding to the first d 
eigenvalues of M, and py = l Zy, y = 1,. . . ,h. 

The minimum value L^ = Lfi{'y,Py)-, which we call the residual sum of 
squares, is 

h rih _ p 

(3) L, = ^^||Z,,-Z,f + n Y. A, 

y=lj=l j=d+l 

for d<p — 1 and 

h rih 

(4) L, = ^5:i|Z,,-Z,f 

for d = p. 

The usual SIR test statistic Tn{m) for testing d = m versus d> m, where 
m <p, can be found by comparing the residual sum of squares under the 
null hypothesis to that under the alternative, 

p 

(5) Tn{m) = Lm - Lp = n ^ Xj. 

j=m.+l 
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Assuming that X has a multivariate normal distribution and implicitly as- 
suming the coverage condition, Li (1991) proved that the distribution of 
Tn{d) is asymptotically chi-squared with (p — d){h — d— 1) degrees of free- 
dom. Bura and Cook (2001a) proved that Tn{d) has the same asymptotic 
distribution under the coverage and linearity conditions plus the following 
condition: 

(C3) Constant covariance condition — Var(Z|P5y Z) = Q^y.^., 

where Qsyiz = Ip ~ Psyiz- This condition is equivalent to requiring that 
Var(Z|P5y Z) be a nonrandom matrix. Normality of X implies the lin- 
earity and constant covariance conditions, but not the coverage condition. 
Bura and Cook (2001a) also proved that in general Tn{d) is distributed as a 
weighted sum of independent chi-squared random variables and showed how 
to construct consistent estimates of the weights for use in practice. 

In the next section we relate the coordinate hypothesis P-hSyiz = ^p to 
the population structure of SIR. 

3.2. Coordinate hypotheses and SIR. Let Qy = \pfy^ let /x be the p x h 
matrix with columns gyFi{7i\Y = y), y = 1, . . . ,h, and construct the singular 
value decomposition 

«=) -(-■ -«)("■ o)(:;:[ 

where T = (Ti,Tq) and ^ = (^i, ^o) are px p and hx h orthogonal matri- 
ces, Ds is a dx d diagonal matrix of positive singular values and the various 
submatrices have the following dimensions: 

Ti:p X d, Tq-.p X p — d, ^I'.hxd, ^Q-.hxh — d. 

Under the linearity and coverage conditions, Span(ri) = Syiz and so under 
these conditions we can take "f = Ti as our basis for Syiz- 

The following two propositions relate coordinate hypotheses to the pop- 
ulation structure of SIR. The proofs seem straightforward and are omitted. 



Proposition 2. Assume that the linearity and coverage conditions hold. 
Then each of the following two conditions is equivalent to the coordinate 
hypothesis P'hSy\z = Op-' 

(i) QhTi=Ti. 
(ii) ?t:cSpan(ro). 

In addition, the coordinate hypothesis implies the following: 

(iii) Q„ro = ro(r^gwro;. 
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(iv) Ff{ = Fg Qt-iTq is a {p — d) X [p — d) symmetric idem,potent m,atrix 
of rank p — d — r. 

(v) Gt-i = I(p-d) ~ ^H = Tq P-^Fq is a {p — d) X (p— d) symmetric idem- 
potent matrix of rank r . 

Proposition 3. Assume that the linearity and coverage conditions hold. 
If PwSyiz = C'p, then the singular value decomposition of ^i is the same as 
that of Q-HfJ-- 

4. Test statistic construction. In this section we discuss results tliat will 
facilitate construction of statistics for testing the two new hypotheses de- 
scribed in Section 1. Proceeding by analogy with the nonlinear least squares 
derivation of SIR described in Section 3.1, the test statistics will be con- 
structed as the difference between the residual sums of squares under null 
and alternative hypotheses. The residual sum of squares under a dimension 
hypothesis d = m can be written using (3)-(5) as 

(7) L„ = ^f;||Z,,-Z,||2+r„(m) 

y=lj=l 

for m = 0,...,p. Here we define T„(p) = so that (3) and (4) are both 
covered by (7). 

We will also need the residual sum of squares under a coordinate con- 
straint P'H'Sy\z = ^p ^-iid a dimension constraint d = m. Because S is typi- 
cally unknown, it will have to be estimated for use in practice. Thus we let 

W = Span(S ax). 

To construct the residual sum of squares under coordinate and dimension 
constraints, write 

yj yj 

yj 

Because B represents an orthonormal basis for Syiz: we impose the con- 
straint -PgB = 0, thus reducing Lm to 

-^m(B,Cy) = 2_]||Zyj -Zyll +22\\^n'^y\\ 

(8) y^ y^ 



yj 



+ ^||Q^(Z,-BC, 



where the prime on L'^ indicates the imposition of the coordinate constraint. 
For fixed B with B B = B Q^B = /„, the minimum is attained by C„ = 



'W 
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B'^Q^Zy, y = 1,. . . ,h. Consequently, with irKp — r, 



h n. 



y 



.^if. E E 11^7^(2. - BC,)f = mm^ ||Q^(Z, - BB^Q^Z,)f 
' ^ y=^j=i yj 

p p—r 

= n J2 ^j='^ E ^i' 

j=m+l j=m+l 

where A^ > • • • > Ap are the eigenvalues of Q^NLQj:.. The last equality follows 
since the last r eigenvalues of Qr^MQ^ are all 0. li m>p — r, then 

,^i?,EllQ7?(Z.-BC,)f = 0. 

" yj 

Substituting into L^(B,C.y) given in (8) we obtain the residual sum of 
squares 

(9) L'^ = Y,\\Zyj -Zyf + J2\\P^Zyf + T:^{m), 

yj yj 

where T^im) = nYTj=m+i ^j ^^^ we adopt the convention that T^{p) = 0. 

In the next two sections we use (7) and (9) to construct test statistics for 
the new hypotheses introduced in Section 1. 

5. Marginal coordinate hypotheses. The marginal coordinate hypothesis 
P'hSy\t. = Op versus P'hSy\x / Op can be used to test the contributions of 
selected predictors without requiring a statement concerning the dimension 
of 5y|z. The test statistic TniTi) is the difference between the residual sums 
of squares under the null and alternative hypotheses: 

(10) Tn{n) = L'p-Lp = ntrace(P^MP^) 

(11) =||V^vec(a^ZOf, 

where vec is the usual operator that maps a matrix into a vector by stacking 

its columns, Q; = XI ax{a^'^ q^;)"^'^ is an orthonormal basis for ?Y and 

Z„ is the p X h matrix with columns g-yZy so that M = Ijn'^n ^^^^ ^n -^ M- 
The representation of T„('H) given by (10) is what might be expected based 
on intuition: to test if PhSy\z = Op we consider the size of the projection 

of M onto the subspace specified by the hypothesis. Before using (11) to 
describe the asymptotic distribution of TniTi) we consider another form of 
the statistic that might provide additional insights. 
Because E(Z|y)EcSy|z, 

Uy = S-i(E(X|y = y)- E(X)) G 5y,x, y = l,...,h. 
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Consequently, under the coordinate hypothesis we must have oCj.Vy = for 

all y. Letting Uy = 'tl (X^ — X), the test statistic can be written in terms 
of the hypothesized estimates ot^i'y of as 

h 

TnCH) = ^ fyvloL,^{oLl± aly^aluy. 

i=l 

5.1. Asymptotic distributions. A little setup is needed before we can de- 
scribe the asymptotic distribution of TniTC). Define the indicator variable 
Jy = 1 if y is in slice y and otherwise, let Py = S~ Cov(X, Jy) and let 

£y = Jy — fy — Py (X — E(X)) denote the population residual from the OLS 
fit of Jy on X. Let e be the hxl vector with elements £y, let D^ be the hx h 
diagonal matrix with gy on the diagonal and recall that a, the population 
version of a, is defined by (2). Finally, let Xi{D),X2{^)^ ■ ■ ■ jXk{D) denote 
independent chi-squared random variables, where the degrees of freedom D 
and K vary with context. 

Theorem 1. Assume that the linearity condition holds. Then, under the 
coordinate hypothesis P'hSy\7, = Op; y/n\ec{di Z„) converges in distribution 
to a normal random vector with mean and covariance matrix 

(12) n7^ = E(D^i££^D-i®a^ZZ^Q). 
Consequently, from (11), 

hr 
i=l 

where wi > 0^2 > • • • > ^hr ^i"^ ihe eigenvalues of fty^ . 

This theorem requires the linearity condition but not the coverage condi- 
tion. If the coverage condition fails so SIR estimates a subspace S of Syiz^ 
it provides a test of P-^S = Op, but we will necessarily miss part of the CS. 
If the coverage condition holds, then SIR estimates the whole CS and the 
theorem provides a test of the complete hypothesis P'H'Sy\z = Op- As dis- 
cussed later in Section 8, the test implied by this theorem might be useful 
even if the linearity condition fails. 

If we have conditions C1-C3, then fi^^ can be simplified. Let Qg = I — 
gg , where g denotes the hxl vector with elements gy. 

Corollary 1. // the linearity, coverage and constant covariance con- 
ditions hold, then 

(13) Qh = {Qg - tJ'^fJ') ® Ir 
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and 

i=i 
where 6i> ■ ■ ■ >6h = are the eigenvalues of Qg — /x"^/x. 

5.2. Implementation. The test statistic Tn(7i) is the same for ah versions 
of the test, but the reference distribution changes depending on conditions 
(C1)-(C3). In the most general case described in Theorem 1 we need to 
estimate the eigenvalues of the hr x hr covariance matrix Q.-j-i to construct 
the reference distribution. We can construct a consistent estimate fi-?^ of fi-^^ 
by substituting sample estimates for the unknown quantities: 

-y h riy 

(14) "h = - E E ^'a'^y^^lPt ® ""^Z^.Z^i", 

ft -, . ^ 



where a and Z^j are as defined previously and D^ is an /i x /i diagonal 
matrix with Qy on the diagonal. Also, kyj is the h x \ vector of the resid- 
uals for observation (yj), with one residual from each of the sample linear 
regressions of Jy on X. Letting lji denote the eigenvalues of J^t^, a p- value 
for the coordinate hypothesis can be constructed by comparing the observed 
value of Tn{T~i) to the percentage points of Yl,iLi "^ixK^)- There is a substan- 
tial literature on computing tail probabilities of the distribution of a linear 
combination of chi-squared random variables. See Field (1993) for an intro- 
duction. Alternatively, tail areas can usually be approximated adequately 
by using Satterthwaite's approximation. 

We can proceed similarly under conditions (C1)-(C3). The p-value can 
be found by comparing TniTi) to the percentage points of the distribution 
of X]i=i '^iX?('')) where 5i > • • • > 5/, = are the eigenvalues of 

(15) nn = {Qg-'Ll'Ln)®Ir. 

each with multiplicity r. 

For ease of reference, we refer to the test using the weighted chi-squared 
reference distribution constructed from (14) as the general test. The test 
using reference distribution constructed from (15) will be called the con- 
strained test. Both tests use the same statistic T„('H), but the reference 
distribution depends on applicable constraints, as given in Corollary 1. 

6. Conditional coordinate hypotheses. The conditional coordinate hy- 
pothesis P'hSy\7. = Op versus P'hSy\z 7^ Op given d might be useful when d 
is specified as a modeling device, or when inference on d using Tn{m) results 
in a clear estimate. A test statistic Tn{7i\d) can again be constructed as the 



12 R. D. COOK 

difference between the residual sum of squares under the null and alternative 
hypotheses: 

Tn{n\d)=L'^-Ld 

(16) =Tn{n)-{Tr,{d)-T'M)) 

d d 

(17) =n5]A,-n^A;., 

3=1 3=1 

where Tl^{d) = n YTj=d+i K ^^^ Tn{d) = n YTj=d+i '^j ^^^ ^^ defined in (5) and (9). 
Form (17) gives one way to compute the statistic and shows that it depends 
on the largest d eigenvalues of Qfi^Qf; and M for the null and alterna- 
tive hypotheses. In contrast, the usual SIR statistic Tn{m) depends on the 
smallest p — m eigenvalues of M. Form (16) will be easier to work with when 
developing the asymptotic distribution of Tn(TC\d) because it allows us to 
use some known results. To develop the asymptotic distribution of Tn(7i\d) 
we consider first the asymptotic distributions of Tl^{d) and Tn{d) — T!^{d) 
because these are components of Tn{7i\d) and may be of interest in their 
own right. For instance, Tl^{m) = L'^ — L' and thus it can be viewed as a 
test statistic for a dimension hypothesis given a coordinate constraint. 

6.1. Asymptotic distribution ofT^{d). The asymptotic distribution of 
T^((i) can be found by using results of Bura and Cook (2001a). Define 

Bura and Cook [(2001a), equations (8)-(13) and associated discussion] first 
used the general results of Eaton and Tyler (1994) on the asymptotic distri- 
bution of singular values of a random matrix to conclude that the asymptotic 
distribution of Tn{d) is the same as that of n||U„|p. They then established 
that 

^/nvec{Zn - /^) ^ Nph{0, A) 
and thus that 

(18) V^vec(U„) ^ N^p.d)ih-d) (0, i^^ ® rj') A(*o ® Tq)), 

where the hp x hp matrix A is as defined by Bura and Cook (2001a). It can 
be represented as an hx h array of p x p matrices A^s = Ipfs + (1 — '^fs)'^z\s 
and Ats = gtQsilp - ^z\t - ^z\s), where I^zis = Var(Z|y = s), s,t = I,. . . ,h. 
Thus 

{p-d){h~d) 
1=1 
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where wi > L1J2 > • • • > ^ip~d)(h~d) 3-^6 the eigenvalues of the covariance ma- 
trix in the asymptotic distribution of ■v/nvec(U„) given in (18). 

The asymptotic distribution of T^{d) can be found similarly. Define 



V^V',, = V^T^iQ-Zr, - Q„/i)* 







where the second equality follows because /x'^o = from the singular value 
decomposition (6). It follows from Eaton and Tyler (1994) that Tl^{d) and 
n||vec(U^)|P are asymptotically equivalent because, from Proposition 3, 
fj, and Q-hh have the same singular value decomposition. Now, 

V^vec(U;,,) = {Ih-d ® r^(3^)\/^vec(Z„*o)- 

Since vec(Z„,*o) -^ 0, it follows that i/nvec(Z„'J'o) converges in distribu- 
tion. Because Q^ converges in probability to Qi-c, it follows from Slutsky's 

theorem that we can replace TC with TC in ^/nvec(\J'^) without affecting its 
asymptotic distribution. Consequently, \/nvec(U^) is asymptotically equiv- 
alent to 

(4_d r]^ Qn)Vn yec{Zn^o) = Vnvec(r^g7^Z„*o) 

= V^vec(F„r^Z„*o) 

= {Ih-^d ^ Fn)Vnyec(Un), 

where the second equality follows from parts (iii) and (iv) of Proposition 2. 
Consequently, the asymptotic distribution of Tl^{d) is the same as that of 
n||(//j_rf (g) F->^) vec(Un)|p, which can be determined from the asymptotic 
distribution of y^vec(Un) given in (18). This enables us to conclude the 
following. 

Proposition 4. 

{p-d){h-d) 

where uji>uj2>---> oj(p-d){h-d) '^^^ ^^^ eigenvalues of 

n'^ = (*^ F^rJ) A(*o ® roF„). 

Additionally, the following corollary follows from Bura and Cook [(2001a), 
Theorem 2] and the fact that Ft.^ is a symmetric idempotent matrix of rank 
p — d — r [Proposition 2(iv)]. 

Corollary 2. Assume that the linearity, coverage and constant co- 
variance conditions hold. Then T^{d) is distributed asymptotically as a chi- 
squared random variable with [p — d — r){h — d — 1) degrees of freedom. 
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Given that P-}-iSy\z = Op, fifi and Qnt^t^ Qh have the same rank d. 
Consequently, we might expect Tn{d) — T^{d) to reflect little more than 
random variation. Consider the orthogonal decomposition 

nil vec(U„)f = 7i||(4_rf F^) vec(U„)f + n||(4_rf Gh) vec(U„)f, 

where G-^f is as defined in Proposition 2(v). As discussed previously in this 
section, the left-hand side is asymptotically equivalent to Tn{d) and the first 
term on the right-hand side is asymptotically equivalent to Tl^{d). Thus, the 
second term is asymptotically equivalent to T^ (d) — T^ (d) : 

(19) Tn{d)-Tl,{d)=n\\{h-d®G'H)vec{Vn)f + o.p{l). 

The next corollary gives the asymptotic distribution of Tn{d) — Tl^{d) under 
conditions C1-C3. Its proof parallels that of Corollary 2 and is omitted. 

Corollary 3. Assume that the linearity, coverage and constant covari- 
ance conditions hold. Then Tn{d) —T!^{d) is distributed asymptotically as a 
chi-squared random variable with r{h — d— 1) degrees of freedom. 

6.2. Asymptotic distribution of Tn(TC\d). The asymptotic distribution 
of Tn{TC\d) can be found under the coordinate hypothesis by using the fol- 
lowing proposition. The proof given in the Appendix relies on (19). 

Proposition 5. 

r„(W|d) = ||(*f0/,)V^vec(a^Z„)f + Op(l). 

Using this proposition in combination with Theorem 1 gives the following 
theorem. 

Theorem 2. Assume that the linearity and coverage conditions hold. 
Then, under the coordinate hypothesis P'hSy\z = Op, {^i ®/r)\/nvec(Q: Z„) 
converges in distribution to a normal random vector with mean and co- 
variance matrix 

Consequently, 

dr 



Tn{n\d)^Y.^,xl{i). 



c 

i=l 

where wi > cj2 > • • • > ^dr ^i"c the eigenvalues of fi-^i^. 
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It may be useful when reading this theorem to recah that r <p — d for a 
meaningful coordinate hypothesis. In particular, flfiu is not defined when 
d = p. 

As in Section 5, if conditions (C1)-(C3) hold, then fi-^u can be simplified: 

Corollary 4. // the linearity, coverage and constant covariance con- 
ditions hold then 

^n\d = {Id - Da) «> It 
and 

T4n\d)^Y.(^-X,)x]{r), 

where Xi > ■ ■ ■ > \d > are the nonzero eigenvalues of /x/x"^ and T)x is a 

diagonal matrix with diagonal elements Xj, j = 1, . . . ,d. 

The generalized inverse of fl-nid in Corollary 4 could be used to construct 
a Wald test statistic with an asymptotic chi-squared distribution under the 
coordinate hypothesis of Theorem 2. A similar comment applies to (13) 
under the coordinate hypothesis of Theorem 1. 

6.3. Implementation. The results of Theorem 2 can be implemented in a 
manner similar to the implementation of Theorem 1 described in Section 5.2. 
A consistent estimate ^i of ^i can be constructed from the singular value 
decomposition of Z„ just as ^i is obtained from the singular value de- 
composition of /x given in (6). A consistent estimate of ft-^i^^ can then be 
constructed as 

(20) n^|rf = (*f0/,)i7„(*i®/,,), 

where Cl-j-c is as given in (14). Similarly, the asymptotic reference distribution 
of Corollary 4 can be estimated by substituting the largest d eigenvalues 
Ai, . . . , Ad of M for Ai, . . . , Arf, which amounts to estimating ft-^^^. by using 

(21) nn\d = iId-J^x)^Ir, 

where D^^ is a diagonal matrix with diagonal elements Xj, j = 1, . . . ,d. 

Following the terminology for tests of marginal coordinate hypotheses, we 
refer to the test using reference distribution constructed from (20) as the 
general test. The constrained test uses the weighted chi-squared reference 
distribution based on (21). These two tests use the same statistic Tn{TC\d); 
only the reference distribution changes. 
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7. Simulation results and data analysis. Simulation studies were con- 
ducted to insure that the asymptotic tests behave as expected and to pro- 
vide a little insight about their operating characteristics. Each study was 
based on one of the following two models: 

(22) Y = Xi + e, 

P'' ^ = 0.5 + (^'+1.5)^ +^- 

The number of observations n, the number and distribution of the predictors 
X and the distributions of the errors e and 5 depend on the simulation. To 
avoid inadvertent tuning by choice of the number of slices, every simulation 
run used /i = 5 slices. Test results were tabulated over 1000 replications for 
each sampling configuration. 

7.1. Estimated versus nominal levels. In this section we report some rep- 
resentative results to compare estimated and nominal levels. The estimates 
were obtained by counting the number of p-values that were less than or 
equal to a nominal level in the 1000 replications for each sample configura- 
tion. These p-values were obtained by applying the tests to a predictor not 
represented in the mean function of the model, so r = 1. 

Estimated levels of all seven statistics described here are shown in Table 1 
for simulations from model (22) with five i.i.d. standard normal predictors, 
an independent normal error and various sample sizes. For instance, the 
estimated levels shown in subtable A are for the test statistic T^iTi) with 
its general reference distribution. The p- values were computed by comparing 
Tniji) to the quantiles of the weighted chi-squared distribution constructed 
by using the covariance matrix in (14). The results seem quite good for n = 
100 and 200. Tests based on an estimated weighted chi-squared distribution 
(subtables A-D) tend to be liberal. This conclusion held up throughout all 
the simulations of test level conducted. The performance of the chi-squared 
statistics (subtables E-G), which tended to be conservative, was similar to 
that reported by Bura and Cook (2001a) for T„(d). The statistics T!^{d) and 
Tn{d) — T^{d) were included in Table 1 to provide numerical support for the 
asymptotic calculations described previously. An investigation of possible 
roles for them in data analysis is outside the scope of this report. Discussion 
in the remainder of this section is confined to tests of the marginal and 
conditional coordinate hypotheses. 

A substantial increase in the number of predictors typically required that 
the sample size be increased to achieve consistent agreement between the 
estimated and nominal levels. Shown in Table 2 are estimated levels for 
the two general and two constrained tests based on model (23) with p = 10 
independent standard normal predictors. The agreement between the esti- 
mated and nominal levels for n = 400 and 800 seems quite good. Comparing 



TESTING PREDICTORS 17 

the results for Tn{TC) with those for Tn{TC\d) at n = 50, 100 suggests that 
tests based on Tn(Ti.\d) need somewhat larger sample sizes to achieve simi- 
lar agreement. This might be because use of T„,('H|(i) requires an estimate 
of ^i that is not required to use TniTL) [see (20)]. 

The two general tests, one for marginal coordinate hypotheses and one 
for conditional coordinate hypotheses, will probably be the most useful in 
practice since they require the fewest assumptions. In comparison, the cor- 

Table 1 

Estimated level of nominal 1,5,10 and 15% tests 

based on various statistics and reference 

distributions for model (22) unth p — 5 

independent standard normal predictors and 

e = 0.2N{0,l) 







Nominal level (%) 




n 


1 


5 


10 


15 




(A) 


r„CH) with hn (14) 




50 


2.8 


9.1 


16.9 


21.7 


100 


1.1 


6.1 


11.5 


18.8 


200 


1.0 


5.3 


11.4 


16.7 




(B) 


T„(H) with Cln (15) 




50 


2.1 


8.1 


15.2 


20.1 


100 


1.0 


5.6 


10.9 


17 


200 


0.9 


5.3 


10.3 


16.3 




(C) T„{H\d) with n«|d (20) 




50 


4.2 


10.2 


16.4 


23 


100 


2.4 


7.3 


12.3 


18.5 


200 


1.7 


5.3 


10.4 


14.9 




(D) T„{n\d) with n^id (21) 




50 


3.0 


8.5 


14.7 


20.6 


100 


1.9 


6.3 


12.2 


17.6 


200 


1.6 


5.2 

(E)r„(d) ^ 


9.9 

■X'(12) 


14.6 


50 


0.4 


4.6 


11 


17.2 


100 


0.9 


4.1 


9.1 


14.7 


200 


1.4 


4.9 

(F)T;(d). 


9.7 

-X'(9) 


14.1 


50 


0.5 


4.8 


10.3 


15.4 


100 


0.7 


4.2 


9.2 


14.8 


200 


1.0 


4.8 


9.3 


15.1 




(G) 


Tn{d)^T:, 


(d)~x'(3) 




50 


1.5 


5.3 


12.2 


17.9 


100 


0.9 


4.5 


9.1 


15.2 


200 


0.9 


4.9 


10.0 


16.0 
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Table 2 






Estimated levels from model (23) 


with 


P = 


10 independent N{0, 1) predicto' 


rs and 






5 = 0.2N{0,: 


L) 








Nominal level (%) 




n 


1 


5 


10 


15 




(A) Tn{H) with n 


■n (14) 




50 


3.3 


11.6 


22.8 


31.9 


100 


1.8 


7.8 


16.0 


21.1 


200 


2.2 


7.0 


13.0 


18.1 


400 


1.3 


4.8 


9.8 


15.1 


800 


1.4 


5.8 


10.3 


14.9 




(B) T„(H) with n 


h(15) 




50 


2.9 


9.8 


19.2 


29.5 


100 


1.3 


7.2 


14.2 


19.9 


200 


1.9 


6.9 


12.2 


17.7 


400 


1.2 


4.8 


10.0 


14.4 


800 


1.4 


5.9 


10.1 


14.8 




(C)T„ 


{n\d) with Q 


'm\d (20) 




50 


7.2 


17.7 


26.3 


31.1 


100 


4.1 


9.2 


15.5 


20.8 


200 


2.0 


8.0 


UA 


19.7 


400 


0.8 


5.1 


10.5 


15.0 


800 


0.8 


4.6 


10.5 


14.5 




(D)T„ 


{n\d) with n 


''H\d (21) 




50 


5.7 


15.3 


23.9 


30.2 


100 


3.2 


8.2 


14.9 


20.0 


200 


1.6 


7.6 


14.0 


19.1 


400 


0.9 


4.4 


10.2 


14.6 


800 


0.8 


4.9 


10.4 


14.5 



responding constrained tests achieved similar agreement between the esti- 
mated and nominal levels with somewhat smaller sample sizes. 

The results in Table 3 are intended to give some idea about the impact of 
the predictor distribution on the actual level of the two general tests. The 
subtables are designated as A and C to correspond to their designations in 
Tables 1 and 2. The simulation setup leading to Table 3 was repeated with 
other predictor distributions, including the t distribution with five degrees of 
freedom and the uniform (—2, 2) distribution. The results for these predictor 
distributions were quite similar to the results in Table 3. 

Over the range of simulations represented in this study it was observed 
that the estimated level of a nominal 1% test was nearly always between 
1 and 5% and the estimated level of a nominal 5% test was nearly always 
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Table 3 

Estimated test levels from m,odel (23) 

with 10 independent X^(4) predictors and 

5 = Q.2N{Q,l) 







Nominal level (%) 




n 


1 


5 


10 


15 




(A) Tn{H) with n 


!h (14) 




100 


2.5 


8.2 


13.4 


19.3 


200 


1.1 


5.4 


11.0 


15.7 


400 


1.2 


6.3 


11.5 


17.2 


800 


1.3 


5.3 


10.4 


15.4 



(C) Tn{H\d) with Un\d (20) 

100 1.6 7.2 12.7 18.7 

200 1.6 7.3 12.5 18.9 

400 0.7 3.2 7.9 12.9 

800 1.0 5.6 10.0 16.0 



between 5 and 10%. No simulations were conducted with more than 12 
predictors or more than 800 observations. 

7.2. Power. In this section we report results from a power study to gain 
insight into the operating characteristics of the proposed tests. It is not 
difficult to find examples where the power is near 1, the nominal level or 
anywhere between these extremes. To provide a benchmark for interpreta- 
tion, the standard linear model t-test was included in the study. 

The results reported in Table 4 are from model (22) with five independent 
standard normal predictors, n = 200 and three different errors e. For each 
model configuration, the power of the standard t-test for the hypothesis that 
the coefficient of Xi equals 0, and the power of the general marginal coordi- 
nate test for Xi , were estimated by computing the fraction of rejections in 
1000 replications. The first column of Table 4 indicates the test. The second 
column indicates the nature of the error and will be described shortly. The 
third and fourth columns give the estimated power (PR) at the nominal 1 
and 5% levels. The differences between the estimated and nominal levels for 
all tests in Table 4 were found to be roughly as those of Table 1 . 

To provide some information about estimation in addition to that for 
testing, we also computed the absolute sample correlations c between Xi 
and the fitted values from the OLS fit of 1" on X, including an intercept, 
and between Xi and the first SIR predictor. The 0.05, 0.5 and 0.95 quantiles 
co.osi Co. 5 and co.g of the empirical distributions of these absolute correlations 
are given in columns 5-7 of Table 4. 
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Table 4(A), e = crN^O, 1). For cr < 2 the two procedures were observed to 
yield essentially identical results. Both tests rejected in all 1000 replications, 
and the absolute correlations were all quite high. The results for cr = 1 are 
shown in the first two rows. The t-test was observed to be the clear winner 
for o" > 3; the results for a = 6.4 are shown in the third and fourth rows 
of this table. The qualitative nature of these results should perhaps not be 
surprising since the i-test has the home field advantage with a honioscedastic 
normal error. The estimated powers at 1 and 5% of the general conditional 
coordinate test T{7i\d = 1) were observed to be 0.275 and 0.469 for a = 6.4. 
Comparing these results with the corresponding results in the table suggests 
that a substantial part of the power differences between the t- and TniTi.)- 
tests can be attributed to the differential information on dimension. 

Table 4(B), e = 6A{x^{D) - D)/V2D. The scaling of this chi-squared 
error was chosen so that it has the same first two moments as the case with 
a = 6.4 in Table 4(A). As expected, the results for large D were essentially 
the same as those for a = 6.4 in Table 4(A). Results for D = 10 are shown in 
the first two rows. The corresponding estimated powers at 1 and 5% of the 
general conditional coordinate test T(7i\d= 1) were observed to be 0.348 
and 0.538. As illustrated in the third and fourth rows of this table, the 
performance of the marginal coordinate test is much better that the t-test 
when D is small. The corresponding estimated powers at 1 and 5% of the 

Table 4 
Power results based on model (22) with three different errors e 

Test PR@0.01 PR@0.05 co.os co.s co.95 



t 

Tn{n) 


cr = l 


(A) 
1 
1 


e = aN{0,l) 
1 
1 


0.977 
0.970 


0.992 
0.990 


0.998 
0.998 


t 
r„(H) 


a = 6.4 


0.359 
0.175 


0.583 
0.364 


0.346 
0.095 


0.765 
0.583 


0.951 
0.772 


t 
Tn{n) 


D = W 


(B)£ = 6.' 
0.374 
0.220 


i{x'{D)-D)/v 
0.609 
0.465 


^2D 
0.308 
0.120 


0.768 
0.698 


0.949 
0.948 


t 

T„CW) 


D = 2 


0.348 
0.797 


0.594 
0.928 


0.284 
0.605 


0.774 
0.895 


0.951 
0.976 


t 


r = 0.75 


{C)e: 
1 
1 


= (e^^i)iV(0,l) 
1 
1 


0.959 
0.954 


0.987 
0.985 


0.997 
0.997 


t 

T4n) 


r = 1.5 


0.508 
1 


0.630 
1 


0.177 
0.938 


0.817 
0.977 


0.980 
0.995 
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conditional coordinate test T{7i\d = 1) were observed to be 0.85 and 0.929. 
The results for the t- and T„('H)-tests were found to be similar for D around 
5 or 6. 

Table 4(C), e = (e^^)A^(0, 1). For r near this model is essentially the 
same as that for cr = 1 in Table 4(A), and the two tests were observed to be 
equivalent. However, with larger values of r the i-test begins to lose ground 
and for sufficiently large values the performance of the coordinate test is 
again much better than the i-test. Results for r = 0.75 and 1.5 are shown. 

The results of this section suggest that, while the coordinate tests might 
not perform as well as tests optimized for particular models, they perform 
reasonably across a wide range of regressions, particularly since they do not 
require a model for y|X. 

7.3. Choice of d. As illustrated in the power study of Section 7.2, T„('H|d) 
can be expected to have greater power than Tn{TC), and consequently there 
are potential gains from inferring about d prior to testing predictors. On 
the other hand, misspecification of d can lead to conclusions different from 
those based on the true value. In this section we describe qualitative results 
from a simulation study to investigate this behavior. Conclusions are based 
on the general marginal Tn{7i) and conditional Tn{7i\d) coordinate test of 
each of the individual predictors. 

Consider n = 200 observations from a regression with five independent 
standard normal predictors Xj and response Y = ^(Xi,X2) + e, where the 
standard normal error e_lLX. When ^ = Xi + e-^'^ , the marginal dimen- 
sion tests Tn{m) resulted in the correct conclusion that d = 2, and the five 
conditional tests Tn(Ti.\d = 2) correctly concluded that only Xi and X2 are 
relevant to the regression. The five marginal tests Tn{'H) reached the same 
conclusion. With d underspecified as 1, the five tests based on Tn{Ti.\d= 1) 
also resulted in the correct conclusion that only Xi and X2 are relevant. 
Underspecification did not affect the conclusions in this case because the 
first SIR direction was close to Span(ei + 62), where e^ denotes the 5x1 
vector with a 1 in the ith position and otherwise. In other words, both 
Xi and X2 were manifested in the first SIR direction, and so Tn{'H\d = 1) 
was able to detect contributions from both predictors. With d overspeci- 
fied as 3, Tn{TC\d = 3) resulted in the conclusion that Xi, X2 and X4 are 
significant, thus giving an upper bound on the set of relevant predictors. 

When fj, = Xi — X2 + e^^'^~^^^\ the marginal dimension tests again re- 
sulted in the correct conclusion that d = 2, and TniTild = 2) and Tn{TC) 
again correctly concluded that only Xi and X2 are relevant to the regres- 
sion. However, this time with d underspecified as 1, the test Tn(Ti.\d = 1) 
incorrectly concluded that only Xi is relevant. Underspecification affected 
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the conclusions in this case because X2 was not captured by the first SIR di- 
rection, which was close to Span(ei). With d overspecified as 3, Tn(Ti.\d = 3) 
again indicated three significant predictors, including Xi and X2. 

Results of this study, including results not reported here, suggest that 
misspecification of d need not be a worrisome issue when the marginal di- 
mension tests result in a clear estimate and that estimate is used in T{TC\d). 
When the value of d is not clear, it is still safe to base inference on the 
marginal coordinate test Tn{TC). 

7.4. Lean body mass regression. We revisit the lean body mass regres- 
sion [Cook and Weisberg (1999b)] to illustrate practical aspects of the previ- 
ous development. Lean body mass (LBM) is regressed on the logarithms of 
height (Ht), weight (Wt), sum of skin folds (SSF) and the logarithms of the 
five hematological variables red cell count (RCC), white cell count (WCC), 
plasma ferritin concentration (PFC), hematocrit (He) and hemoglobin (Hg) 
for 202 athletes at the Australian Institute of Sport. Logarithms of the eight 
predictors were used to help insure the linearity condition. Both females 
and males are represented in the data in approximately equal proportions. 
However, for this illustration we neglect gender in the regression. 

The SIR chi-squared p- values for the marginal dimension hypotheses d = 
m,m = 0,1,2,3, are about 0, 0, 0.13 and 0.46. Consequently, we initially 
inferred that d = 2, keeping in mind that d = 3 is also a possibility. The first 
two SIR directions f)^ and 7)2 are shown in the second and third columns of 
Table 5. The numbers in parentheses are the approximate standard errors 
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Fig. 1. Two scatterplots representing the SIR "fit" of the lean body mass regression: (a) 
LBM versus »)f X; (b) residuals versus J7^X. 
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Table 5 












Results from the 


lean body mass regression with 


all eight predictors 






Fit 






p- values 




X 


Vi 


V2 


T„(W) 


T„{'H\d = 


2) 


T„(W|d = 3) 


log[SSF] 


-0.158 (0.06) 


-0.076 (0.45) 













log[Wt] 


0.971 (0.22) 


-0.023 (1.6) 













log[Hg] 


0.140 (0.69) 


0.347 (5.3) 


0.830 


0.199 




0.369 


log[Ht] 


0.088 (0.65) 


-0.332 (5.0) 


0.344 


0.270 




0.537 


log[WCC] 


-0.007 (0.08) 


-0.015 (0.59) 


0.794 


0.650 




0.899 


log[RCCl 


0.011 (0.49) 


0.502 (3.8) 


0.090 


0.014 




0.032 


logpc] 


-0.073 (0.85) 


-0.715 (6.5) 


0.221 


0.021 




0.098 


log[PFC] 


0.003 (0.03) 


0.004 (0.25) 


0.040 


0.820 




0.192 



proposed by Chen and Li [(1998), page 297]. A scatterplot of LBM versus 
the first SIR predictor fji X is shown in Figure 1(a). The mean function in 
this plot is noticeably curved. Letting e denote the residuals from the OLS 
fit of LBM on (t)^ X, f)2 X), the need for a second direction is evident in a 3D 
plot of e versus {i)i Xj'^g ^)i which has a clear saddle shape. A scatterplot 
of e versus 7)2 X is shown in Figure 1(b). 

In the context of SDR there are now at least three options to aid in as- 
sessing the significance of the individual predictors to the regression. We 
might develop a model for LBM|X, guided by a 3D summary plot of LBM 
versus (t)^ X, 7)2 X) . Predictors could then be tested in the context of the 
resulting model. This type of procedure has produced useful results in the 
past, but there could be a worrisome possibility that the modeling process 
would effectively invalidate nominal characteristics of subsequent tests. An- 
other possibility is to follow the case study by Chen and Li [(1998), Section 
5.2] and use the approximate standard errors to guide variable selection. 
The assessment here is based on the general versions of the marginal and 
conditional coordinate tests. 

The last three columns of Table 5 give the p-values from the marginal 
Tn{TC)-test and the conditional tests Tn{7i\d = 2) and Tn(7i\d = 3) applied 
to each predictor in turn. We see from all three sets of tests that SSF and Wt 
contribute significantly to the regression, and probably RCC as well. The 
correlation between the first SIR predictor r)^ X based on the full data and 
the first SIR predictor from the regression of LBM on log(SSF),log(Wt) 
is about 0.9995, so these two identified predictors largely account for the 
shape of the plot in Figure 1(a). The correlation between the second SIR 
predictors from the same regressions is about 0.83. Evidently, SSF and Wt 
contribute significantly to the first two directions, while other predictors 
contribute mostly to the second direction. As in linear regression, two cor- 
related predictors might both have relatively large p- values, while deleting 
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Table 6 
Results from the lean body mass regression with four predictors 





Fit 










p- values 






X 


Vi 


V2 


V3 


Tr^iU) 


T„ 


.{■H\d = 


2) 


T„ 


.(W|d = 3) 


logpFC] 
log[RCCl 
log[SSF] 
log[Wt] 


0.010 
0.023 
-0.356 - 
0.934 - 


0.199 

0.556 

-0.592 

-0.549 


0.556 
-0.714 
-0.395 

0.159 


0.043 
0.004 








0.390 
0.039 










0.013 
0.001 







either causes the p-value for the remaining predictor to decrease substan- 
tiahy. Using Tniji) to test simultaneously the effects of the last six predic- 
tors in Table 5 yields a p- value of about 0.034, suggesting that some of those 
predictors also contribute to the regression. The tests Tn{T~i\d) with (i = 2, 3 
produced the same conclusion with similar p-values. 

The results so far can be partially summarized in terms of the hypothe- 
sis y_lL X2IX1, where X = (Xj^,X2 )• The tests gave firm indications that 
hypotheses with X2 = log(SSF) and X2 = log(Wt) are false. There was no 
notable information to reject hypotheses with X2 set to any combination of 
log(Ht), log(Hg) and log(WCC). Conclusions regarding the remaining three 
predictors were relatively ambiguous, depending on a dimension specifica- 
tion. One of the advantages of this type of analysis may be the ability to see 
which conclusions are firmly supported by the data without prespecifying a 
dimension for 5y|x and which depend on specification of a dimension and 
perhaps eventually a model. Nevertheless, to focus the analysis we deleted 
the three predictors that were judged to be unimportant and started over. 
The SIR chi-squared p- values from this regression for the hypotheses d = m, 
m = 0, 1, 2, 3, were about 0, 0, 0.010 and 0.31. Consequently, we now inferred 
that d = 3, a conclusion that remained stable for the rest of the analysis. This 
situation is consistent with the known propensity of the marginal dimension 
test to lose power when irrelevant predictors are added to the regression. 
Additionally, there was no notable evidence in this five-predictor regression 
to indicate that log(Hc) is relevant, leaving us with the reduced regression 
of LBM on the remaining four predictors (SSF, Wt, RCC, PFC). 

Additional results for the reduced regression are given in Table 6. The 
sample correlations between the first, second and third SIR predictors from 
the full regression and the corresponding predictors from the reduced re- 
gression are 0.9997, 0.93 and 0.98, suggesting that the two regressions are 
giving essentially the same information about y|X. The p- values in Table 6 
now give a consistent message for all predictors except PFC, which is judged 
nonsignificant when the dimension is under specified as 2. This result could 
be anticipated from the discussion of underspecification in Section 7.3, if 
substantial information on PFC is furnished by the third direction. This 
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interpretation is supported by the coefficients in Table 6, which were com- 
puted after marginally standardizing the predictors to have a sample stan- 
dard deviation of 1. Additionally, the absolute sample correlations between 
log(PFC) and the three SIR predictors of Table 6 are about 0.32, 0.21 and 
0.63. 

In analogy with linear regression, we could have proceeded more straight- 
forwardly by using backward elimination based on marginal or conditional 
tests to arrive at a reduced set of predictors. Starting with the marginal test 
in column 4, Table 5, and sequentially removing predictors whose p- values 
are larger than 0.05 yields the results given in the fifth column of Table 6. 
The same procedure based on the conditional test with d = 3 yields the re- 
sults in the last column of Table 6. The conditional test with d = 2 ends in 
with the same predictors, except PFC is excluded because d is underspeci- 
fied. 

8. Discussion. The theory of sufficient dimension reduction grew from 
a body of literature on how to graphically represent a regression in low di- 
mensions without loss of information on y |X. Much of the development was 
inspired by the idea of linear sufficient statistics developed from a paramet- 
ric view by Peters, Redner and Decell (1978) and Li's (1991) development 
of SIR. The primary motivation for the regression graphics ideas in Cook 
(1998a) stemmed from a desire to see how far graphics could be pushed in 
the analysis of regression data. The central subspace (CS) proved to be a 
key tool in that inquiry. The CS is intended to play a role similar to Li's 
(1991) EDR subspace, but it is a distinct population parameter constructed 
to insure that any nested sequence of dimension reduction subspaces always 
leads to the same population subspace. The methods developed here would 
not be possible using the EDR subspace because, in part, the fundamental 
equivalence of Proposition 1 would fail. 

Data analytic techniques (e.g., SIR, SAVE, PIR) for pursuing sufficient 
dimension reduction have mostly lived in a world apart from mainstream 
methodology, although there are threads leading to other ideas and methods 
[see Chen and Li (1998) for a discussion]. By outlining a general context for 
testing predictors and developing a specific implementation using SIR, this 
article moves the inferential capabilities of SDR a step closer to mainstream 
regression methodology. The connection with tradition is also strengthened 
by casting SIR in terms of nonlinear least squares. 

SIR has generated considerable interest since it was introduced. Hsing 
and Carroll (1992) develop a version of SIR in which each slice contains two 
observations so that the number of slices grows with the sample size. This 
two-slice method was extended by Zhu and Ng (1995) to allow for slices with 
more than two observations. The version in this article uses fixed slicing in 
which the number of observations per slice grows with the sample size. Zhu 
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and Fang (1996) bypass the slicing step and use kernel smoothing instead. 
Schott (1994) investigated inference methods for d when the predictors follow 
an elliptically contoured distribution. Elliptically contoured distributions are 
not required for the general methods in this article. 

Cook and Critchley (2000) showed that SDR methods can be useful 
for identifying outliers and regression mixtures. Assuming d to be known, 
Gather, Hilker and Becker (2001) developed a robust version of SIR by 
replacing its components (e.g., X, XI and M) with robust estimates. The 
nonlinear least squares formulation of SIR described in Section 3 allows for 
alternative robust versions of SIR that involve using a loss function other 
than least squares. 

The linearity condition (CI) and the coverage condition (C2) are the only 
two population conditions necessary for the theoretical justification of SIR. 
The constant covariance condition (C3) is used only to simplify the asymp- 
totic distribution of the test statistic under the null hypothesis. Letting 
M = Var(E(Z|y)), the role of the linearity and coverage conditions is to 
insure that Span(M) =Sy\z- Without these two conditions, the asymptotic 
distributions given in Theorems 1 and 2 remain valid if Sy\z is replaced 
by Span(M), but we may lose the equality Span(M) =Sy\z that provides 
an informative link with the population. As argued previously, the linear- 
ity condition need not be worrisome in practice, particularly if we use the 
adaptation methods discussed in Section 3.1. Li (1997) studied what can 
happen when the linearity condition fails, and Chen and Li (1998) devel- 
oped an interpretation of SIR that might be helpful in some applications 
when Span(M) /5y|z- 

APPENDIX: JUSTIFICATIONS 
A.l. Proposition 1. Let the columns of the matrix 77 be a basis for 5y|x- 

Then yXX|77^X if and only if Y _\L{Pn^,QH^)\{v^PH^ + v'^Qh^)- 
Now, 

PhSyi^ = Oj, =^ YMPH^,QH^WQn^ 

=^ YJLPn^^lQn^- 

For the reverse implication, Y ALPn^lQn^ implies 1" _1L (Pt-^X, Q-}iX.)\Q-}iX. 
and consequently Span{Q-}-i) is a dimension reduction subspace. Since the 
central subspace is assumed to exist, any dimension reduction subspace must 
contain the central subspace, which therefore must be in Span(<5-^). It fol- 
lows that Pn<SY\x = (^p- 

A. 2. Theorem 1. The yth column ^/n{a 1jn)y of ^/na Z„ can be ex- 
pressed as 



VTi{a^Zn)y = V^{ocl± ^a^y^/^a^gy± ^^^Z, 



y 
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Recalling that Jj, = 1 if y is in slice y and otherwise, and that f3 = 
S~ Cov(X, Jy), it seems straightforward to verify that /3„ = /j^S~ ' E(Z|y = 
y) with OLS estimator 

(24) ^y = fyt-'^^Zy. 

The linearity condition implies that /3 G Syix ^I'ld thus under the coordi- 
nate hypothesis OL^Py = 0. Consequently, y/na^jSy converges in distribution 
and we have 

Li, Cook and Chiaromonte (2003) provided a general expansion for OLS 
estimators that is applicable to (24). Using this we get 

n 

V^cxl{0y - (3y) = n"i/2^^5.-i/2 ^ 2iej^, + Oj,{n~^'^), 

i=l 

where 

£yi = Jyi ~ fy ~ Py O^i ~ E1(X)) 

is the population residual from the OLS regression of Jy on X. Thus, sub- 
stituting we get 

n 
i=\ 

where a = Yr^l'^OL.j.(c)L^^Yr^a.,j)~^l'^ as defined in (2). 
Next define the p x h matrix 



W„ — 2_^(Zjeij, . . . , ZjE/ij) — 2_^ ^i^i 



-T 

'i 1 



where Si is the /i x 1 vector with elements e^j, y = 1, . . . ,h. Define also the 
r X h matrix 

Then we have shown that 

V^vec(a^Z„) = n-^/2 vec(V„) + Opin-^/^) 

= (D-i a^)n-i/2 vec(W„) + Op{n-^/^) 



(Dg^ a^)n-i/2 ^ £ . Zi + Op(n-i/2). 



4 = 1 
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Because e contains OLS residuals, it is uncorrelated with any linear func- 
tion of X; in particular, Cov(e, Z) = 0. It follows that E(£ ® Z) = and there- 
fore, by the multivariate central limit theorem, n~^'^vec(W„) converges in 
distribution to a normal random vector with mean and covariance ma- 
trix Var(e(X'Z). Consequently, n~^'^vec(V„) converges in distribution to a 
normal random vector with mean and covariance matrix 

n-H = (D~^ ® OL^) Var(e ® Z)(D~^ ® ex) 

which is the desired conclusion. 

A.3. Corollary 1. 

A. 4. Equation (13). Under the linearity condition, e is a measurable 
function of Y and F^ Z, and 

nw = E[D-ie£^D-^®Q^E(ZZ^|(y,rfZ))Q]. 

The linearity and coverage conditions imply that E(ZZ |y, Ti Z) = E(ZZ |r;^ Z) 
and thus 

n-H = E[D^^e£^D^^ ® Q^E(ZZ^|rf Z)q]. 

The linearity and constant covariance conditions imply that 

E(ZZ'^|rf Z) = Var(Z|rf Z) + E(Z|rf Z)E(Z|rf Z)^ 

Consequently, 

Q^E(ZZ^|rf Z)q = ol^Qsy^^cx + OL^ Psy^J^T^^ Psy^'z.ol 

= a Qsy\-z.O'- 

= ol cx = Ij., 

where we have used the facts that under the coordinate hypothesis P^^ ct = 
and TC = Span(Q;) C Span(ro) (see Proposition 2). Thus, in summary to 
this point, 

(25) nH = E(D-iee^D-^)0l^. 

To simplify E(D~^£:e'^D~^), let J and f = E(J) denote the h x 1 vectors 
with elements Jy and fy, and write the residual vector as 

e = J-f-(/iDg)^Z. 
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In anticipation of expanding Fj{ee ) we have 

E(J - f)(J - f)^ = E(JJ^) - fF^ = D/ - ff"^, 
E(Z(J - f)^) = E(E(Z|y)J^) = fiBg. 



Then 



-T\ /'T» _ trT\ oT* ..T..T^ I T-w ..T. 



E{se' ) = (D/ - ff^ ) - 2Dg/x^ /xD^ + D^/^^ /xD^ 

3 9 



E(D-i££^D-i) = /, - gg^ - fi'^fi 



= Qg- At^At- 

A. 5. Asymptotic distribution. Since Var(Z) = Ip, we have 

Ip = E(Var(Z|y)) + Var(E(Z|y)) 
= E(Var(Z|y))+/x/i^ 

and consequently Ip — /x/x > 0, which imphes that the eigenvalues of fifi 
are between and 1. The nonzero eigenvalues of fifi^ are the same as the 
nonzero eigenvalues of fi^ fJ- and thus Ih — /^^/^ > 0. Combining this with the 
identity fiQg = /x we have Qg — n^ ^i = Qg{Ih — fJ-"^ tJ')Qg > 0. It follows that 
the eigenvalues of Qg — ^i^ ^i are nonnegative. The convergence in distribu- 
tion follows immediately because 5i > • • • > 5/j = 0, each with multiplicity r, 
are the eigenvalues of ^-h- 

A.6. Corollary 2. It follows from Bura and Cook [(2001a), Theorem 2 
and its justification; see also Cook (1998a), page 213] that, under the lin- 
earity, coverage and constant covariance conditions, 

{^l ® r^) A(*o ® To) = *;^Qg*o ® Ip-d- 

Thus, 

ft'a = ^lQg^o^Fn. 

The matrix Ft^ is a symmetric idempotent matrix of rank p — d — r [Propo- 
sition 2(iv)] and, from the discussion following Corollary 1, ^q Qg^o is a 
symmetric idempotent matrix of rank h — d— 1. Consequently, S7^ is a sym- 
metric idempotent matrix of rank (h — d — l){p — d — r) and the conclusion 
follows. 
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A. 7. Proposition 5. To find the limiting distribution of TniTild) under 
the coordinate hypothesis, we first use (19) and (16) to write 

(26) Tn{n\d) = TniU) - n||(4_d ® Gt^) vec(U„)f + Op(l) 
and, because ** = I/j, 

h rih 

^5] ||P^Z,f =ntrace(P^Z„**^Z^P^) 
y=lj=l 

= ntrace(P^Z„*i*f Z^P^) 

+ ntrace(r^P-Z„*o*ifZ^i='^r). 
The second term in this expression for Tn{7i) can be represented using 

The conclusion for the first coordinate relies on the fact that P^Cj^i converges 
to in probability under the hypothesis and y^vec(Z„*o) converges in dis- 
tribution. The conclusion for the second coordinate follows by an argument 
similar to that used in Section 6.1. Recalling that U„ = V^'En'^Q, we have 

ntrace(r^P^Z„*o*K^H^) =^trace(GwU„U^GH) +Op(l) 

= nil (/(,,_rf) Gw) vec(U„)f + Op(l). 
Combining this result with (26) we have 

T„(7^|d) =ntrace(P-Z„*i*fZ^P^) +Op(l) 

(27) =ntrace(«i^Z„*i*fZ^Q;)+Op(l) 



= ||(*^ 0l,)V^vec(a^Z„)r + Op(l). 

A. 8. Corollary 4. Under the linearity, coverage and constant covariance 
conditions, the covariance matrix of the asymptotic normal distribution of 
('J'f (8)/)-y/nvec(Q: Z„) can be found by using Corollary 1: 

n^lrf = (*f ® Ir)^H{^l ® It) 

= (*f ® Ir)iiQg - /x'^At) <S) /r)(*l ® Ir) 

= {Id-'Dx)^Ir, 

where Dx = D^ is the d x d matrix of the nonzero eigenvalues of /x /^. The 
third equality follows because 
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and thus 

The final equahty follows because ^iQg = /x and thus FiDs^j^ Qg = FiD^*! , 
which implies that ^i Qg = ^^ . 

The eigenvalues of ^7-^1^ are 1 — \j with multiplicity r, j = 1, . . . , d. Con- 
sequently, 

r„(W|d)^5:(l-A,)x|(r). 
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